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Internet of things COVID-19. The main challenges are the image datasets, which are 
Jetson Nano unstructured and may grow large, affecting the accuracy and speed of the 
Nvidia detection. Secondly is the portability of the detection devices, which are 
You only look once generally dependent on the more portable like NVDIA Jetson Nano or from 
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to design and implement real-time face mask wearing detection using the 
pretrained dataset as well as the real-time data. 
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1. INTRODUCTION 

Deep learning has sparked significant attention in a spectrum of uses, such as machine vision [1]—[4]. 
It attempts to discover the target visual features from randomly generated image sources. Some examples of 
applicable deployments include facial recognition [5]—[7] motion detection [8]—[11], image classification [12], 
[13], and vehicle detection [14]-[17]. Deep learning creates new opportunities for the development of 
intelligent interactions between people and their devices or technology, paving the path for these new 
possibilities to emerge. As a result of the current global epidemic situation, individuals are mandated to wear 
facial masks, which has resulted in the challenge of inspecting individuals when they are wearing facial masks 
in settings such as public or open spaces. Particularly in the wake of the present worldwide pandemic scenario, 
when certain healthcare protocols must be followed, including the wearing of facial masks and social 
separation, face recognition has emerged as one of the topics that is garnering a lot of attention as a hot button 
issue [18]. 
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Currently, different approaches have been devised for detecting facial masks based on deep learning 
[19] such as the region proposal network (RPN) [20], [21] and the faster region-based convolutional neural 
networks (R-CNN) network methods [22]-[24]. On the other hand, the detection speed of these methods is 
relatively slow, and this is especially the case when they are implemented on low-power processing units like 
the NVDIA Jetson Nano. The you only look once (YOLO) algorithm [23], [25]—[27] is a novel method that is 
based on deep learning. It is an enhanced and faster alternative to the traditional approach. It has been 
established that the efficacy of YOLO is ten times higher than that of faster R-CNN. As a consequence of this, 
it is of the utmost importance to make certain that the categorization software is installed in a system that has 
a limited amount of central processing unit (CPU) power and a computational capacity that is relatively low. 
It is also of the utmost importance to install the software in a system that has a relatively low computational 
capacity. 

NVIDIA's Jetson is a feasible enabler for the introductory phase of machine and computer vision due 
to its low-power processing capability that it offers as one of forefronts in artificial-intelligence hardware 
development for computer vision [28]-[30]. The CPU- graphics processing unit (GPU) architecture of Jetson 
Nano [31], [32], enables the CPU to load faster while the GPU seamlessly runs the machine-learning 
techniques. It has a sleek design, is portable, and uses little power, making it perfect for application domains 
which require constraints in weight and power. Because of the improved processing time and accuracy, 
YOLOv5 is anticipated to operate feasible image detection functions like facial mask wearing detection 
identification and counting using Jetson Nano. Therefore, this paper presents a YOLOv5-based algorithm for 
face mask wearing detection and counting using both the NVDIA’s GTX and Jetson Nano platforms to address 
these healthcare and monitoring issues. 


2. METHOD 

This section describes the step-by-step methodology employed in the study. There are four main steps 
employed in this paper, which are the development of the deep learning model, the data creation, the model 
training and the model inferencing. In the next subsection, the employed deep learning model will be presented. 


2.1. Deep learning model 

A one-stage algorithm based on YOLOvS, which is considerably fast in processing detection and 
prediction, is used. The algorithm has a unique characteristic in which it is capable of redefining the detected 
object as a regression problem so that it can be computed at a high computation rate [1], [2]. This is essential 
to ensure fast detection performance on a standalone platform like Jetson Nano which has a limited processing 
capacity. 

The one-stage architecture of YOLOv5 consists of three main components: the backbone, neck and 
head. The backbone (CSPDarknet) is a convolutional neural network (CNN) which is responsible for 
performing feature extraction by collecting and shaping the image features at various granularities. It utilizes 
the center and scale prediction (CSP) technique to produce the image features. Next, the features will be 
forwarded to the neck (PANet) stage for feature fusion, where image features are combined. Finally, the 
combined features will be fed to the head (YOLO layer) for prediction and classification. Figure 1 illustrates 
the YOLOv5 architecture. In this paper, a recent deep model based on YOLO v5 [23] is proposed and 
implemented onto a Jetson TX1 [1] for facial mask wearing detection and counting. In the next section, the 
YOLO v5 model implemented with Jetson Nano TX1 is presented before the results and conclusion are 
presented in section III and IV respectively. 


2.2. Data creation 

To train the model, a dataset of images was created using public images of face masks. The images 
were divided into three categories; 1) with a face mask, ii) without a face mask, and iii) masks worn incorrectly. 
The dataset included 848 images. A data augmentation technique is then applied to generate more images and 
add some noise to the images to ensure the proposed model is robust against noise. As a result, a total of 2034 
images with and without noise were generated. All the images were annotated using the YOLOvS5 format for 
training — PyTorch version of YOLOVS. For training, 87% of the dataset was allocated, with 8% for validation 
and 4% for testing. The framework of the detection is shown in Figure 2. 


2.3. Model training 

Two YOLOvS5 models have been trained with and without using a pretrained model. Both models 
were trained using the Nvidia GTX 1660 6 GB GPU for 100 epochs. From the results obtained, it has been 
observed that the models achieved acceptable and satisfactory accuracy values in the training. 
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2.4. Model inferencing 

The trained models were tested in two different testing modes with distinct computational power. 
First, the mode is tested using an Nvidia GTX 1660 6 GB. For the second test, an embedded system using a 
Jetson Nano Board 4 GB has been used. 
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Figure 1. YOLOv5 architecture. CSP (cross stage partial network), SPP (spatial pyramind pooling), conv 
(convolutional layer) and concat (concatenate function) sub-components 


Dataset Training 


Detection: Capture object (people) in the 
image 
Counting: Total number of people detected 


Prediction: 
Classification into category (1), (2) or (3) 


Training stage 


Model stage 


Sewanee et 


Output image 


Figure 2. Framework of the proposed model 


3. RESULTS AND DISCUSSION 

The proposed YOLOv5 model has been trained in two configurations, either with or without the 
pretrained model (from scratch). The dataset used in this training is the Kaggle facemask detection dataset. The 
training and testing of the proposed YOLOv5 model are carried out on this dataset. Before delving deeper into 
YOLOv5, a comparison study was conducted between the proposed YOLOv5 model and other current deep 
learning-based approaches for face mask recognition, the results of which are displayed in Table 1. It is clear 
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that the YOLOv5 model has attained the highest mAP when compared to the other approaches. Furthermore, 
YOLOvVS has been recorded to be capable of performing the detection operation at very high frame rates frames 
per second (FPS), which are 120 FPS using the NVDIA GTX 660 6 GB and 20 FPS with the Jetson Nano. 


Table 1. Comparison between different deep learning-based methods for face mask detection 
Face mask detection models _mAP Measurement _NVDIA GTX 6606GB Jetson Nano 


Centernet resnet50 v2 0.57 20FPS 
Faster R-CNN resnet50 v1 0.59 8.3FPS 
Ssd mobilenet v1 fpn 0.61 15FPS 
Ssd resnet50 v1 fpn 0.57 12FPS - 
YOLOv5 0.70 120FPS 20FPS 


In the training phase, various input images containing people with/without face masks from the public 
Kaggle dataset have been fed as the input dataset to the YOLOvS algorithm. Some of the images which have 
been detected with the face mask wearing have been recorded and shown in Figure 3. It can be observed in this 
figure that the algorithm is capable of detecting the people wearing the facial masks and counting them, 
regardless of the number of people, in most of the images found in the dataset. The numbers of people detected 
wearing the masks correctly in each of the images, starting from the top left image in figure are: 4, 7, 0, 1, 1, 
0, 4, 10, 3, 1,1, 1, 1, 1, 2. 

Figure 3 shows the confusion matrix of the model. It can be noticed that the model confuses mostly 
between with mask class with mask weared incorrect, which is by a rate of 0.43. In other words, almost 43% 
of the people who wear the maks incorrectly have been mistakenly predicted as the people who wear the masks 
correctly. This is expected due to the similarity between the two classes. Another reason is the small number 
of samples for the mask weared incorrect class. However, the model faces no issue differentiating between the 
two with mask and without mask classes as it shows a 0.02 confusion rate. In other words, only 2% of the 
people who do not actually wear masks that are mistakenly predicted as the people who wear masks. 
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Figure 3. The confusion matrix 
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Figure 4 shows the F1 score for the model over different confidence levels. As can be seen in the 
figure, the model shows the highest F1 score and confidence level for the with mask class compared to the 
other classes. The mask weared incorrect has the lowest Fl score over confidence, which agrees with the 
confusion matrix result presented before. 
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Figure 4. Fl score 


Figure 5 shows the overall analysis result for the model, including the training loss, the box loss, the 
recall and the precision (mAP) measurement. It can be seen that the model progresses well over the 100 epochs. 
The mAP value keeps increasing as the number of epochs increases and the training loss reduces as the number 
of epochs increases. 
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Figure 5. Various metrics measured over 100 epochs, where the horizontal axis represents the number of 
epochs (The horizontal axis represents the number of epochs and the vertical axis represents the 
corresponding loss, precision or recall values accordingly) 


Figure 6 shows the detection results obtained when the model is tested without using the pretrained 
model. Although most of the images shown in Figure 6 have been successfully detected as either with a mask, 
without a mask, or a mask worn incorrectly, the overall performance is slightly lower than the result obtained 
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when the algorithm is tested using the pre-trained model. The reason behind the slight decrease in the 
performance of the algorithm when tested without the pre-trained model is straight forward. 


OR ata So 


Figure 6. Detection results using the pretrained model 


Figure 7 shows the FI score obtained without the pre-trained model over different confidence levels. 
As seen in the figure, the highest Fl score and confidence level are obtained when detecting the with mask 
class as compared to the other classes. The mask weared incorrect class has the lowest Fl score over 
confidence, which the confusion matrix explains. It can be noticed that the Fl score for the mask weared 
incorrect class decreases radically as compared to the result obtained when using the pre-trained model. 
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Figure 7. Fl score without the pre-trained model 
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Finally, the overall metrics are measured and plotted in Figure 8 for the case without the pre-trained 
model. Although the performance is slightly lower than the pre-trained model, the performance is still 
considerably acceptable, especially when the number of people being detected in the image is not so large. 
Therefore, YOLOv5 is suitable for facial mask wearing detection and counting in closed areas or public 
buildings that have some limitations on the number of people allowed to be in them, such as offices, schools, 
and religious places such as prayer halls or mosques, and so forth. So, the novelty highlighted in this paper is 
the ability of the proposed YOLOv5 model to detect face masks worn by people while counting the number of 
those wearing masks correctly or otherwise. With the relatively high detection accuracy achieved, this model 
is able to better calculate or estimate the number of people wearing masks in the image under consideration. 
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Figure 8. Metrics measurement over 100 epochs without using the pre-trained model, where the horizontal 
axis represents the number of epochs (The horizontal axis represents the number of epochs and the vertical 
axis represents the corresponding loss, precision or recall values accordingly) 


4. CONCLUSION 

Based on the results obtained and presented in this paper, it can be concluded that YOLO v5 is useful 
for detecting and counting people wearing facial masks. This is essential to control the spread of viruses, 
especially when the building or area to be entered by the people is closed and limited in space. By counting the 
number of people wearing the facial masks correctly, necessary and further actions can be taken to stop the 
people who are not wearing the masks from entering the building, besides ensuring that the number of people 
wearing the masks is within the maximum number of people allowed to be in the building. 
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